# run the install command if did not do it before
# install.packages('remotes')
# remotes::install_github("Bolin-Wu/workshopr", subdir = "rpackage", force = TRUE)
# load the package
library(workshopr)
library(tidyverse)
library(here)
Introduction
This document is share useful data manipulation skills at daily epidemiology work. My main goal is to follow the “dont’t repeat yourself” (DRY) principle. It can make our code more readable and reduce our chance of making mistakes.
The content includes:
- transform data shape
- get the label from DTA and SPSS in R
- assign function
- apply, lapply, sapply
- filter variables based on name pattern
- check missing values
- bin variables by percentiles
Transform data shape
Transform data shape for many people, including me, sounds troublesome. Within the R world, its relevant functions are evolving overtime as well.
In the beginning, I used spread and
gather.
Every time I use spread and gather, it takes
me a while to figure out how to fill in ‘key’ and ‘value’. But as you
can see from their documentation, their ‘lifecycle’ is ‘superseded’.
Now I only use pivot_longer and pivot_wider
for transforming data. You can find their comprehensive documentation here. They
come with better documentation, more powerful application, and better
integration with tidyverse syntax.
Example
Let’s assumme we received a wide format data:
head(fake_snack_df)
The column names are:
sort(colnames(fake_snack_df))
## [1] "Date_wave1" "Date_wave2" "Date_wave3" "Date_wave4"
## [5] "Date_wave5" "Date_wave6" "dementia_wave1" "dementia_wave2"
## [9] "dementia_wave3" "dementia_wave4" "dementia_wave5" "dementia_wave6"
## [13] "education" "Lopnr" "sex"
Now, assume for some reason, e.g. merge it with other data set, we want to transform it in a long format. There are two variables should be formated: ‘Date’ and ‘dementia’. For beginners, I would recommend to start small.
- Select the interested columns
fake_snack_df %>%
select(contains("Date"))
- Read documentation, try to fill in the arguments. The 3 basic
arguments are:
colstells R what variables to pivot.names_todefines a new name for columns incols.values_togives a new name for values under the columns incols.
Let’s give a first try:
# check pivot_longer() documentation in R: ?pivot_longer()
fake_snack_df %>%
select(Lopnr,contains("Date")) %>%
pivot_longer(cols = contains("Date"),
names_to = "wave", names_prefix = "Date",
values_to = "date")
The result above looks good, but ‘wave’ looks a bit strange. I will leave the task to audience to fix this column.
- Do the same with ‘dementia’ columns
fake_snack_df %>%
select(Lopnr,contains("dementia")) %>%
pivot_longer(cols = contains("dementia"),
names_to = "wave", names_prefix = "dementia",
values_to = "dementia")
Task
- Change the arguments in the pivot_longer function to get proper wave
column. Result example:
- Merge the two long pivot dataset together.
- Optional: Pivot the original ‘fake_snack_df’ by using only one
pivot_longerfunction. Put wave and date after ‘Lopnr’ column. Hint: check thenames_toargument andrelocate()function. Result example: